R IntroductionR?R is a programming language used for statistical analysis and graphics. It is based on S-plus, which itself was based on S, a programming language originally developed by AT&T.
R?R: Object-Oriented ProgrammingUnlike many other statistical software such as SAS and SPSS, R will not spit out a mountain of output on the screen.
Instead, R returns an object containing all the results. You, as an user, have the flexibility to choose which result to be extracted or reported.
R: Functional ProgrammingThis feature allows us to write faster yet more compact code. For example, a common theme in R programming is avoidance of explicit iteration. Unlike many other statistical softwares, explicit loops are discouraged.
Instead, R provides some functions that could allow us to express iterative behavior implicitly.
R: PolymorphicR is also polymorphic, which means that a single function can be applied to different types of inputs (much more user friendly).
Such a function is called a generic function (If you are a C++ programmer, you have seen a similar concept in virtual functions).
Lets look at one example plot()
No matter which purpose, we use the same function.
data<-c(1,2,3,4)
plot(data)
# Regression Analysis
par(mfrow=c(2,2),mar=c(2,4,2,2))
results<-lm(speed ~ dist,data=cars)
plot(results)
R Interface is ugly!
Many students in this class are much more familiar with Windows operation system and have never been exposed programming before, so we will use R studio, one of the free Graphical User Interfaces (GUIs) that have been developed for R.
Easy publishing of reproducible documents such as reports, interactive visualizations, presentations, and websites.
Initial Start
When you first (like very first time) open R studio you will see three panels.
Console
> . As its name suggests, this prompt is really a request, a request for a command.The console is where you type commands and have them immediately performed.
Environment
The panel in the upper right contains your workspace (aka Environment)
3 has been assigned to the object a.History
Up here there is an additional tab to see the history of the commands that you’ve previously entered.
Files
The files tab allows you to open code/script files within R studio.
Plots
Any plots that you generate will show up in the panel in the lower right corner.
Help
To check the syntax of any function in R, type ? in front of the function name to pull up the help file.
For example here I typed ?mean to get the help file for the mean function. The help files are not always the most useful but are usually a good place to start.
Script File The top left is your editor window, where you write code or script, the console is now at the bottom. I usually change it
The picture above illustrates my preferred style in R Studio.
Most of R users typically submit commands to R by typing either in console or editor panel, rather than clicking a mouse in a Graphical User Interface (GUI).
In this class, we will make extensive use of scripts. A Script is nothing but a collection of commands and procedures that the coder performed to get to their results and conclusions..
There are at least two advantages of doing so:
This will always be our approach in this class!!!
File > New > R Script.This will open a blank text document.
x = 5 # Assign the variable x a value of 5
x == 5 # Does x = 5? Notice the double ==
Highlight both lines of code and click the button marked “Run”. If everything is working correctly, the console should display TRUE.
OR, pressing or depending on whether you’re running Mac OSX, Linux or Windows.
At the top of the previous script (Task 1), write add and expand on the following comments:
Follow the example given above.
Arithmetic
1 + 1 #add numbers
## [1] 2
8 - 4 #subtract them
## [1] 4
13/2 #divide
## [1] 6.5
4*pi #multiply (Pi is a built in function in R)
## [1] 12.56637
2^10 #exponentiate
## [1] 1024
Logical arguments will result in a value of TRUE or FALSE.
3 < 4
## [1] TRUE
3 > 4
## [1] FALSE
3 == 4
## [1] FALSE
3 != 4
## [1] TRUE
10 - 6 == 4
## [1] TRUE
# Notice the difference between single and double equal signs
Now try 3 = 4. What is the result here?
#R delimits strings with EITHER double or single quotes.
#There is only a very minimal difference
message1 <- 'Let us get to coding!'
message2 <- "Please get to coding!"
print(message1)
## [1] "Let us get to coding!"
print(message2)
## [1] "Please get to coding!"
We can also print the result(s) stored in our variables by simply running the running the variable name instead of print().
message1
## [1] "Let us get to coding!"
“.” and "_" are OK to be added to variable names, but no other symbols.
Your variable name must not start with a number or _ (2squared and _one are illegal).
A note for those of you who have programming experience: while R supports object-oriented programming, periods “.” do not have a special meaning in the language. For historical reasons, R programmers often use periods in place of underscores in variable names, but either works. Just be consistent to keep your code readable.
R is case sensitive. Capitalization of variable names matter.
x <- 42
x / 2
# [1] 21
# redefine x
x <- x + 3
x
# [1] 45
#if we assign something else to x, the old value is deleted
x <- "Hokies!"
x
# [1] "Hokies!"
foo <- 3
bar <- 5
foo.bar <- foo + bar
foo.bar
# [1] 8
entry that stores the year you started at Virginia Tech.current_t.current_t and entry. Store this as diffs.my_year. Now compute the difference between current_t and my_year. Assign the results to my_diffs.To remove all variables in memory:
# ls() # List of all variables in memory
rm(list=ls())
R script (just after the document details).R Data typesRYou have observed a few of the different data types in the earlier sections. Here, we will formally discuss them. Some of the most basic data types we will cover are:
You can check the type of data by using class().
x <- "Lyrics to Virginia Tech Fight Song!"
class(x)
## [1] "character"
x2 <- c("TRUE", "FALSE")
x2 <- as.logical(x2) #Declare the data type
x2
## [1] TRUE FALSE
class(x2)
## [1] "logical"
x <- 1:20
x %% 4 #x mod 4
## [1] 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0
x %% 4 == 0
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
class(x %% 4 == 0)
## [1] "logical"
A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is characterized by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function.
In short, vectors are most useful when we have a collection of data points.
Here c() stands for concatenate or combine.
(v <- c(1, 2, 3, 4))
## [1] 1 2 3 4
(v <- 1:4)
## [1] 1 2 3 4
(v <- seq(from = 0, to = 0.5, by = 0.1))
## [1] 0.0 0.1 0.2 0.3 0.4 0.5
#A vector can also contain characters:
(v_colors <- c("blue", "yellow", "light green") )
## [1] "blue" "yellow" "light green"
Notice that by encasing the beginning and end of the assignment lines in parentheses, we immediately print the stored values.
We are able to index (collect subsets of our variables) by using squared brackets. Unlike python, for example, R’s indexing begins from 1.
v_colors[2] # We are trying to extract the second element of the vector, v_colors
## [1] "yellow"
v_colors[c(1,3)] # We can use the concatenation function to get nonconsecutive elements. Here, we are trying to extract elements in positions 1 and 3.
## [1] "blue" "light green"
How would your extract elements 1:9, 15, 19, 20 and 21:30 in zz below?
set.seed(1234)
zz <- rnorm(100)
Answer:
zz[c(1:19,15, 19, 20:30)]
## [1] -1.20706575 0.27742924 1.08444118 -2.34569770 0.42912469 0.50605589
## [7] -0.57473996 -0.54663186 -0.56445200 -0.89003783 -0.47719270 -0.99838644
## [13] -0.77625389 0.06445882 0.95949406 -0.11028549 -0.51100951 -0.91119542
## [19] -0.83717168 0.95949406 -0.83717168 2.41583518 0.13408822 -0.49068590
## [25] -0.44054787 0.45958944 -0.69372025 -1.44820491 0.57475572 -1.02365572
## [31] -0.01513830 -0.93594860
We can replace elements in specific positions. Below, we replace the second and third colors with red and purple.
(v_colors[2:3] <- c("red", "purple") )
## [1] "red" "purple"
Sometimes it might be more convenient to get rid of particular elements instead. For example, I might want to extract all but the first 5 elements of a vector, or all but the 15th element. We might find it easier to use a negative index here.
j <- c(-1,-2,-3)
x[j]
## [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# We could have done that in one go as well
x[-c(1:3)]
## [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Another common way to subset is by using a logical vector. TRUE will select the element with the same index, while FALSE will not. Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests such as:
x <- 100:110
x
## [1] 100 101 102 103 104 105 106 107 108 109 110
x >105 # returns TRUE or FALSE depending on which elements that meet the condition
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
select <- x > 105
x[select]
## [1] 106 107 108 109 110
If we would like the elements that evaluate to FALSE instead, we could easily use the ! (NOT) operator
x[!select]
## [1] 100 101 102 103 104 105
You can combine multiple tests using:
& (AND operator - both conditions are true) or
| (OR operator - at least one of the conditions is true)
We can test whether x is between the range 103 and 106:
x[x >= 103 & x <= 106]
## [1] 103 104 105 106
x is greater than 103 but (AND) less than or equal to 106
x[x <= 106 & x > 103] # order of subsetting does not matter here!
## [1] 104 105 106
x is less than 103 or greater than 106
x[x >= 106 | x < 103]
## [1] 100 101 102 106 107 108 109 110
Sometimes we will need to search for certain strings in a vector. With multiple conditions, it becomes difficult to use the “OR” operator |. The function %in% allows you to test if any of the elements of a search vector are found:
animals <- c("mouse", "rat", "dog", "cat")
animals[animals == "cat" | animals == "rat"] # returns both rat and cat
## [1] "rat" "cat"
animals %in% c("rat", "cat", "dog", "duck", "goat")
## [1] FALSE TRUE TRUE TRUE
animals[animals %in% c("rat", "cat", "dog", "duck", "goat")]
## [1] "rat" "dog" "cat"
Let’s say that we want to know which color robe each of 3 patients is wearing, we can assign names to the vector of colors.
v_colors
## [1] "blue" "red" "purple"
names(v_colors) <- c("Thomas", "Liz", "Tucker")
v_colors
## Thomas Liz Tucker
## "blue" "red" "purple"
x <- c(1,2,3)
y <- c(4,5,6)
# component-wise addition
x+y
## [1] 5 7 9
# component-wise multiplication
x*y
## [1] 4 10 18
# What happens to the following
y^x # or y**x
## [1] 4 25 216
R# Would this work?
c(1,2,3,4) + c(1,2)
## [1] 2 4 4 6
# Would this work?
c(1,2,3) + c(1,2)
## [1] 2 4 4
Why the weird results?
R automatically repeats the short one to fill in the operation.2*c(1,2,3)
## [1] 2 4 6
Create a new matrix
(matrix<-matrix(1:16, nrow = 4, byrow = TRUE))
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
Note that : means every number from 1 to 4. In the matrix() function:
R will arrange into the rows and columns of the matrix. Here, we use 1:16 which is a shortcut for c(1, 2, 3, 4, … 16).byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we use byrow = FALSE.nrow indicates that the matrix should have 4 rows.Selection of the matrix elements are similar to vectors except we have two dimensions over which to subset- rows and columns.
# matrix[r,c] #Standard form of the matrix.
matrix[1,2] #Extract element in the first row and second column
## [1] 2
#Extract the entire first row and second columns
matrix[,1:2]
## [,1] [,2]
## [1,] 1 2
## [2,] 5 6
## [3,] 9 10
## [4,] 13 14
rownames(matrix) <- c("Yes", "No", "Perhaps", "Maybe")
colnames(matrix) <- c("Apple", "Pear", "Banana", "Grapes")
matrix
## Apple Pear Banana Grapes
## Yes 1 2 3 4
## No 5 6 7 8
## Perhaps 9 10 11 12
## Maybe 13 14 15 16
x <- c(1,2,3)
matrix<-matrix(1:4, byrow = TRUE, nrow = 2)
length(x)
## [1] 3
length(matrix)
## [1] 4
dim(matrix)
## [1] 2 2
dim(x)
## NULL
R doesn’t like vectors to have different types: c(TRUE, 1, "Frank") becomes c("TRUE", "1", "Frank"). But storing objects with different types is absolutely fundamental to data analysis. R has a different type of object besides a vector used to store data of different types side-by-side: a list:
c(TRUE, 1, "Frank")
## [1] "TRUE" "1" "Frank"
x <- list(TRUE, 1, "Frank")
Many different things not necessarily of same length can be put together.
x <- list(c(1:5), c("a", "b","c"), c(TRUE, FALSE), c(5L, 6L))
R’s iris data frame by typing.View(iris)
Data frame with Harry Potter characters
name <- c("Harry", "Ron", "Hermione", "Hagrid", "Voldemort")
height <- c(176, 175, 167, 230, 180)
gpa <- c(3.4, 2.8, 4.0, 2.2, 3.4)
df_students <- data.frame(name, height, gpa)
df_students
## name height gpa
## 1 Harry 176 3.4
## 2 Ron 175 2.8
## 3 Hermione 167 4.0
## 4 Hagrid 230 2.2
## 5 Voldemort 180 3.4
Alternative way of creating DF
df_students <- data.frame(name = c("Harry", "Ron", "Hermione", "Hagrid",
"Voldemort"),
height = c(176, 175, 167, 230, 180),
gpa = c(3.4, 2.8, 4.0, 2.2, 3.4))
df_students
## name height gpa
## 1 Harry 176 3.4
## 2 Ron 175 2.8
## 3 Hermione 167 4.0
## 4 Hagrid 230 2.2
## 5 Voldemort 180 3.4
df_students$good <- c(1, 1, 1, 1, 0)
df_students
## name height gpa good
## 1 Harry 176 3.4 1
## 2 Ron 175 2.8 1
## 3 Hermione 167 4.0 1
## 4 Hagrid 230 2.2 1
## 5 Voldemort 180 3.4 0
dim(df_students)
df_students[2, 3] #Ron's GPA
df_students$gpa[2] #Ron's GPA
df_students[5, ] #get row 5
df_students[3:5, ] #get rows 3-5
df_students[, 2] #get column 2 (height)
df_students$height #get column 2 (height)
df_students[, 1:3] #get columns 1-3
df_students[4, 2] <- 255 #reassign Hagrid's height
df_students$height[4] <- 255 #same thing as above
df_students
Now that you are equipped with the basic, go ahead and take the following Datacamp Course, R Intro on Datacamp. Your invitations should now be in your inbox.
You can use the
getwd()
command to obtain the current directory R is using.
It is good practice to set the working directory location to where the files and data are stored.
Windows
setwd("C:/users/[your user name]/Desktop/AAEC4984/")
# OR
setwd("C:\\users\\[your user name]\\Desktop\\AAEC4984\\")
# notice the double backslashes
Mac
setwd("~/Desktop/AAEC4984")
getwd()
dir()
R allows us to import several file types. I will discuss 3 that we are most likely to use in this course.
textdata<-read.table("examples/hogsdata.txt",header=T)
CSV files :
xlsx files (requires openxlsx package)
xlsxdata<-read.csv("examples/hogsdata.xlsx", ... )
Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. For the purpose of this course, we will use a lot of functions that are built both in base R (that is, they are predfined) or available through R packages (discuss below).
A function usually takes one or more inputs called arguments, and often (but not always) return a value.
Consider for example, taking the average of a set of random numbers (x).
set.seed(124)
x <- rnorm(6) * 100
(round(x, digits=2)) # round function => 2dp
## [1] -138.51 3.83 -76.30 21.23 142.55 74.45
If we were to do this manually, we would:
sumx <- sum(x)
nx <- length(x)
meanx <- sumx/nx
Using R’s built in mean function we can do all three steps internally and cross check against our manual calculations.
mean(x)
## [1] 4.542439
meanx == mean(x) # cross validation
## [1] TRUE
Since R is an Open Source software program, thousands of people contribute to the software. They do this by writing commands (called functions) to make a particular analysis easier, or to make a graphic prettier.
When you download R, you get access to a lot of functions that we will use. However the other user-written packages we use for our analyses will make our lives much easier.
For example, though we can use the plot command for standard graphics, you will quickly see that we can get much better looking time graphs using the fpp2 package (which also uses another package called ggplot2).
To install the fpp2 package, we can use the command
install.packages("fpp2")
We will need to install a package only once in R.
Now that you have the fpp2 package installed, we can check to see if it is in use
search()
Lastly, in order to use the package, we will need to load the library
library(fpp2)
The fpp2 package contains a number of useful datasets. One such data set is h02.
Use the help() function to get a decription of this data. Try
help(h02)
Now let us create a nice plot of the h02 data
autoplot(h02)
Let us leave it there for now!
Comments
Whenever possible, use comments! Anything following the symbol
#in an R Script will not be run in R.Comments are notes we leave ourselves so we know:
I promise that this will become useful when you come back to your code after an extended time. I cannot tell you the number of times I have had a moment of pure genius while coding and I spend hours on a different day trying to understand why I coded it like that or what I actually did.
For example, below is the type of comments that I always include in my programs
You can also understand the following code without even knowing what exactly each line of command does because I tell you what they are!